[fix] Using a pure Go parser for Python by linzhp · Pull Request #2320 · bazel-contrib/rules_python

linzhp · 2024-10-21T20:18:12Z

Replacing tree-sitting with github.com/go-python/gpython, which is in pure Go. This re-enable the ability to cross-compile Gazelle from other platforms to macOS.

Note that gpython doesn't preserve Python comments, while Python extension read Gazelle directives in Python comments (not just BUILD.bazel file comments), I have to scan the Python file for a second time to find the comments, but this doesn't include inline comments. So this will break existing usage of Gazelle directives inline Python comments, which I think should be very rare. In fact, unless I miss anything, I don't think we should allow Gazelle directives in Python comments in the first place. They are designed to be in BUILD.bazel files.

Fixes #1913

@alexeagle FYI.

aignas · 2024-10-22T02:53:31Z

FYI, @dougthor42, you might be using the feature of comments in the source code, would you have time to check this PR please?

linzhp · 2024-10-22T04:38:52Z

hmm, this only support Python 3.4 or below... Marking it as draft for now

dougthor42 · 2024-11-13T05:02:22Z

@linzhp Any update on this? TBH it kinda looks like github.com/go-python/gpython is defunct. It hasn't had any updates since Feb.

I've run into another issue with the tree-sitter-based parser (#2396) which is why I'm curious.

I don't think we should allow Gazelle directives in Python comments in the first place.

I would be quite sad if gazelle annotations (comments in python files) were removed. That's the only way we've been able to deal with pre-existing circular imports within our code base when migrating to Bazel.

dougthor42 · 2024-11-13T05:27:10Z

I wonder if it would be easier to just make our own go parser for python? I've already done a bit while working on a hack for #2396.

The key thing to note is that Gazelle only needs to know a very small subset of python grammar:

imports
comments
the if __name__ == "__main__": clause

For the "happy path", comments and if __name__ clause are pretty easy, with imports being a little more difficult but not absurd. Comments are # to end of line, if __name__ has only a small number of permutations that are needed to check. Imports have 4 forms that need to be supported, but it's not that difficult.

Things start to get tricky when:

imports span multiple lines:
```
from foo import (
    bar,
    baz
)
```

Someone puts a gazelle annotation comment in a string or another comment:

"""Make sure to use '# gazelle:ignore foo.bar' because ..."
# but don't use '# gazelle:include_dep foo.bar' because ...

??? other edge cases I'm sure.

For the record, here's some very-WIP pure go parser for the happy path above. This is what I'm playing with to address #2396

// file_parser.go

// A really really dumb python parser that's "good enough". Hopefully.
func (p *FileParser) naiveFileParser(code []byte) (*ParserOutput, error) {

	codeStr := string(code[:])

	// Parse imports
	modules, err := p.naiveImportParser(codeStr)
	if err != nil {
		return nil, err
	}
	p.output.Modules = modules

	// Parse comments
	// TODO: find comment-in-string or comment-in-comment cases
	commentPattern := regexp.MustCompile(`(?m)#.+$`)
	commentStrings := commentPattern.FindAllString(codeStr, -1)
	var comments []comment
	for _, s := range commentStrings {
		comments = append(comments, comment(s))
	}
	p.output.Comments = comments

	// Parse 'if __name__' block
	// TODO: Support other similar forms like b == a or single quotes
	p.output.HasMain = strings.Contains(codeStr, `if __name__ == "__main__":`)
	return &p.output, nil
}

func indexOf(element string, data []string) (uint32, error) {
	for k, v := range data {
		if element == v {
			return uint32(k), nil
		}
	}
	err := errors.New(fmt.Sprintf("Did not find %q in data.", element))
	return 0, err
}

// naiveImportParser runs a very naive, mostly regex-based, parse of the python source code
// as a string.
// NOTE: This does not support multi-line imports like
//    from foo import (
//	bar,
//	baz
//    )
// Nor does this _really_ support comma-separated like
//    from collections.abc import Callable, Iterator
func (p *FileParser) naiveImportParser(code string) ([]module, error) {
	// Find `import a`, `from a import b`, `import a as b`, `from a import b as c`
	// The pattern doesn't need to be bullet-proof; we'll remove things like comments later.
	importPattern := regexp.MustCompile(`(?m)^(\w*import .+)|(\w*from .+ import .+)( as .+)?`)
	importStatements := importPattern.FindAllString(code, -1)

	// Also support Windows files >:-(
	lines := strings.Split(strings.ReplaceAll(code, "\r\n", "\n"), "\n")

	var modules []module
	for _, v := range importStatements {
		lineNum, err := indexOf(strings.TrimSpace(v), lines)
		if err != nil {
			return nil, err
		}

		// Remove any comments that might be present.
		statement, _, _ := strings.Cut(v, "#")
		statement = strings.TrimSpace(statement)

		// At this point we have something like "import foo", "import foo as bar",
		// "from foo import bar", or "from foo import bar as baz"
		// So we need to "do the needful" to make the dotted import 'foo.bar'
		dottedImport := ""
		from := ""

		// We don't care about aliases. Drop them.
		statement, _, _ = strings.Cut(statement, " as ")

		// 'from foo import bar' is the same as 'import foo.bar'. Because we don't need
		// to support 'from foo import some_attribute'.
		// dougthor42 is lucky here. A "real" implementation might need to worry about this.
		if strings.HasPrefix(statement, "from ") {
			first, second, _ := strings.Cut(statement[5:], " import ")

			// second might be a list of items "from foo import A, B, c, d"
			// If that's the case, we're (probably) importing Attributes so we don't
			// need any of them.
			// TODO: The more correct thing to do here would be to make N module structs.
			// But for now, I'm lazy.
			if strings.Contains(second, ",") {
				second = strings.Split(second, ",")[0]
			}

			// "from foo import bar" --> "foo.bar"
			dottedImport = fmt.Sprintf("%s.%s", first, second)
			from = first
		} else {
			// "import foo.bar" --> "foo.bar"
			dottedImport = statement[7:]
		}

		mod := module{
			Name:       dottedImport,
			LineNumber: lineNum,
			Filepath:   p.relFilepath,
			From:       from,
		}
		modules = append(modules, mod)
	}

	return modules, nil
}

linzhp · 2024-11-13T06:23:23Z

Thanks for being interested. My next attempt would be using antlr to parse Python, but I haven't got there yet. Maybe you can give it a try?

I would be quite sad if gazelle annotations (comments in python files) were removed. That's the only way we've been able to deal with pre-existing circular imports within our code base when migrating to Bazel.

Can you use # gazelle:python_ignore_dependencies in BUILD.bazel files instead?

dougthor42 · 2024-11-13T17:38:53Z

My next attempt would be using antlr to parse Python

The readme says that grammar only targets python 3.6 🫤

Can you use # gazelle:python_ignore_dependencies in BUILD.bazel files instead?

Sadly no, as that would apply at the Bazel package level and thus impact all targets. We need target-level (which for us are file-level) ignores # gazelle:ignore foo.bar.baz and includes # gazelle:include_dep //src/some:target.

I see a couple options:

Work on making the go helper cross-compile. Sadly I wouldn't know where to start with this, and I don't have access to a Mac anyway.
Make our own parser in go. This means there's a maintenance burden of keeping it up to date with python grammar changes, but given that we only care about a small subset of grammar, this might not add much overhead.
Revert back to the python-based parser and:
- work on making it hermetic (don't rely on system interpreter, which was the whole point to migrating to a go parser)
- Run it in parallel because the old python one was slow. I'm still a novice with go, but it's possible that goroutines could help here.

Using a pure Go parser for Python

62876b2

linzhp requested a review from f0rmiga as a code owner October 21, 2024 20:18

linzhp marked this pull request as draft October 21, 2024 20:30

linzhp added 2 commits October 21, 2024 18:44

Update WORKSPACE

e29fb93

go mod tidy

a673d21

linzhp marked this pull request as ready for review October 22, 2024 01:57

linzhp marked this pull request as draft October 22, 2024 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

[fix] Using a pure Go parser for Python#2320

[fix] Using a pure Go parser for Python#2320
linzhp wants to merge 3 commits intobazel-contrib:mainfrom
linzhp:parser

linzhp commented Oct 21, 2024 •

edited

Loading

Uh oh!

aignas commented Oct 22, 2024

Uh oh!

linzhp commented Oct 22, 2024

Uh oh!

dougthor42 commented Nov 13, 2024 •

edited

Loading

Uh oh!

dougthor42 commented Nov 13, 2024

Uh oh!

linzhp commented Nov 13, 2024 •

edited

Loading

Uh oh!

dougthor42 commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Comments

Conversation

linzhp commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aignas commented Oct 22, 2024

Uh oh!

linzhp commented Oct 22, 2024

Uh oh!

dougthor42 commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougthor42 commented Nov 13, 2024

Uh oh!

linzhp commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougthor42 commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linzhp commented Oct 21, 2024 •

edited

Loading

dougthor42 commented Nov 13, 2024 •

edited

Loading

linzhp commented Nov 13, 2024 •

edited

Loading